Users/kirankk/cosmos refresh reason#5791
Users/kirankk/cosmos refresh reason#5791kirankumarkolli wants to merge 5 commits intomsdata/directfrom
Conversation
Adds a design-time-bounded enum + wire-value table so every SDK-forced
cache refresh can be attributed to a specific cause via the generic
`x-ms-cosmos-refresh-reason` HTTP header. This is the foundation for
tagging address-cache refreshes (and, in future PRs, PK-range-cache,
collection-routing-map, etc.).
Phase 1 scope:
* `RefreshReason` internal enum with 23 explicitly-numbered members
(Groups A-D covering server 410, Gone+substatus, transport-synthesized
Gone, and non-Gone forced refreshes).
* `RefreshReasonExtensions`:
- `WireValues` dictionary (single source of truth for enum -> wire
string; values flat, <=2 dot-segments).
- `ToHeaderValue` for serialization.
- `FromTransportErrorCode` exhaustive switch mapping every
`TransportErrorCode` to a specific reason; default -> `GoneUnknown`.
* Header constant `HttpConstants.HttpHeaders.CosmosRefreshReason` in
`src/direct/` (pending upstream `Microsoft.Azure.Cosmos.Direct`
parallel PR).
* Carrier field `DocumentServiceRequestContext.RefreshReason` (default
`Unspecified`) copied by `Clone()`.
* 12 unit tests validating enum/wire-value coverage, uniqueness, regex
conformance, and exhaustive transport-code mapping.
No egress wiring or call-site tagging yet - those land in Phase 2/3.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Plumbs the RefreshReason through the two address-cache force-refresh
egress functions and emits the `x-ms-cosmos-refresh-reason` header
when the reason resolves to non-Unspecified. No call-site tagging yet
(that lands in Phase 3); this commit only adds the wiring + test-only
validator so Phase 3 edits are mechanical.
Changes (Microsoft.Azure.Cosmos/src/Routing/GatewayAddressCache.cs):
* Static test-only toggle `ValidateRefreshReasonPresence` (default
`false`; tests set `true`). When enabled, a forced refresh with
`Unspecified` reason throws `InvalidOperationException` so any
future untagged force-refresh site is caught automatically.
* New private helper `EmitRefreshReasonHeader` centralizes reason
resolution (`explicitReason` precedence over
`request.RequestContext.RefreshReason`), validator check, and
header emission.
* `GetMasterAddressesViaGatewayAsync` + `GetServerAddressesViaGatewayAsync`
take an optional `RefreshReason explicitReason = Unspecified` and
call the helper when `forceRefresh` is true.
* `GetAddressesForRangeIdAsync` takes the same optional parameter
and forwards it; this is the path used by the on-demand
unhealthy-URI revalidation site (to be tagged in Phase 3 as
`ReplicaHealthUnhealthyLongLived`) which does not own the
originating `DocumentServiceRequest`.
Zero production overhead: validator is a single bool read in the
`forceRefresh` branch, default-off.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
There was a problem hiding this comment.
Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"
Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.
Tag every call site that flips ForceRefreshAddressCache = true (or equivalent background refresh) with the originating RefreshReason so the outgoing /addresses request carries x-ms-cosmos-refresh-reason. Sites tagged: - GatewayAddressCache suboptimal-server-partition timer -> InsufficientReplicasSuboptimalTimer - GatewayAddressCache unhealthy-URI background refresh -> ReplicaHealthUnhealthyLongLived (via explicitReason) - GatewayAddressCache master-suboptimal trigger -> InsufficientReplicasSuboptimalTimer - ConsistencyWriter barrier request -> InsufficientReplicasQuorum - QuorumReader barrier request -> InsufficientReplicasQuorum - StoreReader Gone-retry paths (2) -> classified via new ClassifyPriorGone helper walking prior StoreResult exception chain (TransportException.ErrorCode for transport-synth 410, else substatus mapping, else GoneServer/GoneUnknown) New helper: - RefreshReasonExtensions.ClassifyGoneFromException: centralized exception+substatus -> RefreshReason mapping Tagging preserves upstream-set reasons by only writing when still Unspecified. Build green (0 warn / 0 err); 12 unit tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
Full plan document (problem statement, design choices, 23-value enum reference with groupings, file inventory covering this repo and upstream Direct, test surface, phase status, risks) so the PR carries the design context inline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
- Factor FromTransportErrorCode into an internal TryMapKnownTransportErrorCode helper that returns false for unhandled codes. The public API still falls back to GoneUnknown for unknown codes (graceful in production), but the exhaustive unit test now uses TryMap directly, so any new upstream TransportErrorCode will fail CI instead of silently folding to GoneUnknown. - Promote GatewayAddressCache.EmitRefreshReasonHeader from private to internal so the validator/precedence behavior can be unit-tested without a full gateway round-trip. - Add 7 ClassifyGoneFromException tests covering every SubStatusCodes branch (CompletingSplit, CompletingPartitionMigration, NameCacheIsStale, PartitionKeyRangeGone) and the default GoneServer fallback. - Add RefreshReasonEmissionTests with 6 tests for precedence (explicit > ctx), null-request path (on-demand unhealthy-URI refresh), default no-op when untagged, and the opt-in ValidateRefreshReasonPresence invariant that guards future force-refresh call sites from shipping untagged. Tests: 25/25 Routing/RefreshReason pass; 22/22 GatewayAddressCache pass. Build: clean (0 warn / 0 err). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Azure Pipelines: 1 pipeline(s) were filtered out due to trigger conditions. |
|
|
||
| ## 1. Problem | ||
|
|
||
| Every forced address-cache refresh makes the SDK call `GET …/addresses?…` with |
There was a problem hiding this comment.
What it does not have is context on what the connect was to (namely which endpoint(s) failed.
Should we from the start, support delimiting the header so we can suffix additional information.
So something like this
x-ms-cosmos-refresh-reason: gone.connect; endpoints=rntbd://cdb-ms-prod-eastus1-be207.documents.azure.com:14464/
Pull Request Template
Description
Please include a summary of the change and which issue is fixed. Include samples if adding new API, and include relevant motivation and context. List any dependencies that are required for this change.
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber